DizzyRNN: Reparameterizing Recurrent Neural Networks for Norm-Preserving Backpropagation

نویسندگان

  • Victor Dorobantu
  • Per Andre Stromhaug
  • Jess Renteria
چکیده

The vanishing and exploding gradient problems are well-studied obstacles that make it difficult for recurrent neural networks to learn long-term time dependencies. We propose a reparameterization of standard recurrent neural networks to update linear transformations in a provably norm-preserving way through Givens rotations. Additionally, we use the absolute value function as an element-wise non-linearity to preserve the norm of backpropagated signals over the entire network. We show that this reparameterization reduces the number of parameters and maintains the same algorithmic complexity as a standard recurrent neural network, while outperforming standard recurrent neural networks with orthogonal initializations and Long Short-Term Memory networks on the copy problem. 1. Defining the problem Recurrent neural networks (RNNs) are trained by updating model parameters through gradient descent with backpropagation to minimize a loss function. However, RNNs in general will not prevent the loss derivative signal from decreasing in magnitude as it propagates through the network. This results in the vanishing gradient problem, where the loss derivative signal becomes too small to update model parameters (Bengio et al., 1994). This hampers training of RNNs, especially for learning long-term dependencies in data. *Order of authors determined by MATLAB randperm 2. Signal scaling analysis The prediction of an RNN is the result of a composition of linear transformations, element-wise nonlinearities, and bias additions. To observe the sources of vanishing and exploding gradient problems in such a network, one can observe the minimum and maximum scaling properties of each transformation independently, and compose the resulting scaling factors. 2.1. Linear transformations Let y = Ax be an arbitrary linear transformation, where A ∈ Rm×n is a matrix of rank r. Theorem 1. The singular value decomposition (SVD) of A is A = UΣV T , for orthogonal U and V , and diagonal Σ with diagonal elements σ1, . . . , σn, the singular values of A. From the SVD, Corollaries 1 and 2 follow. Corollary 1. Let σmin and σmax be the minimum and maximum singular values of A, respectively. Then σmin‖x‖2 ≤ ‖y‖2 ≤ σmax‖x‖2. Corollary 2. Let σmin and σmax be the minimum and maximum singular values of A, respectively. Then σmin and σmax are also the minimum and maximum singular values of A . Proofs for these corollaries are deferred to the appendix. Let L be a scalar function of y. Then ∂L ∂x = ∂L ∂y ∂y ∂x = A ∂L ∂y In an RNN, this relation describes the scaling effect of a linear transformation on the backpropagated signal. By Corollary 2, each linear transformation scales the loss derivative signal by at least the minimum singular value of the corresponding weight matrix and at most by the maximum singular value. ar X iv :1 61 2. 04 03 5v 1 [ cs .L G ] 1 3 D ec 2 01 6 DizzyRNN: Reparameterizing Recurrent Neural Networks for Norm-Preserving Backpropagation Theorem 2. All singular values of an orthogonal matrix are 1. By Corollary 2, if the linear transformation A is orthogonal, then the linear transformation will not scale the loss derivative signal. 2.2. Non-linear functions Let y = f(x) be an arbitrary element-wise non-linear transformation. Let L be a scalar function of y. Then ∂L ∂x = ∂L ∂y ∂y ∂x = f ′(x) ∂L ∂y where f ′ denotes the first derivative of f and denotes the element-wise product. The i-th element of ∂L ∂y is scaled at least by min (f (xi)) and at most by max (f (xi)).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sampling-Based Gradient Regularization for Capturing Long-Term Dependencies in Recurrent Neural Networks

Vanishing (and exploding) gradients effect is a common problem for recurrent neural networks with nonlinear activation functions which use backpropagation method for calculation of derivatives. Deep feedforward neural networks with many hidden layers also suffer from this effect. In this paper we propose a novel universal technique that makes the norm of the gradient stay in the suitable range....

متن کامل

Regularizing Recurrent Networks - On Injected Noise and Norm-based Methods

Advancements in parallel processing have lead to a surge in multilayer perceptrons’ (MLP) applications and deep learning in the past decades. Recurrent Neural Networks (RNNs) give additional representational power to feedforward MLPs by providing a way to treat sequential data. However, RNNs are hard to train using conventional error backpropagation methods because of the difficulty in relating...

متن کامل

A guide to recurrent neural networks and backpropagation

This paper provides guidance to some of the concepts surrounding recurrent neural networks. Contrary to feedforward networks, recurrent networks can be sensitive, and be adapted to past inputs. Backpropagation learning is described for feedforward networks, adapted to suit our (probabilistic) modeling needs, and extended to cover recurrent networks. The aim of this brief paper is to set the sce...

متن کامل

A normalized gradient algorithm for an adaptive recurrent perceptron

A normalized algorithm for on-line adaptation of a recurrent perceptron is derived. The algorithm builds upon the normalized backpropagation (NBP) algorithm for feedforward neural networks, and provides an adaptive learning rate and normalization for a recurrent perceptron learning algorithm. The algorithm is based upon local linearization about the current point in the state-space of the netwo...

متن کامل

On orthogonality and learning recurrent networks with long term dependencies

It is well known that it is challenging to train deep neural networks and recurrent neural networks for tasks that exhibit long term dependencies. The vanishing or exploding gradient problem is a well known issue associated with these challenges. One approach to addressing vanishing and exploding gradients is to use either soft or hard constraints on weight matrices so as to encourage or enforc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1612.04035  شماره 

صفحات  -

تاریخ انتشار 2016